Machine Learning for Business Analytics

Published

February 25, 2026


Welcome! These notes assume no prior Python experience. Each module builds on the previous one. Work through the tasks in every section — hands-on practice is the fastest path to understanding.


1 Module 1: Python Programming Fundamentals

Python is one of the most popular programming languages for data analysis and machine learning. It reads almost like plain English, which makes it an excellent first language for business students.

1.1 Section 1.1 — Python Basics and Conditional Statements

1.1.1 Variables and Data Types

A variable is a named container that holds a value. Python automatically detects the type of data you store.

Code
# Assign values to variables
company = "Acme Corp"          # str  — text
revenue = 4_500_000            # int  — whole number
profit_margin = 0.18           # float — decimal number
is_profitable = True           # bool — True or False

# Print to screen
print("Company:", company)
print("Revenue: $", revenue)
print("Profit margin:", profit_margin)
print("Profitable?", is_profitable)
Company: Acme Corp
Revenue: $ 4500000
Profit margin: 0.18
Profitable? True
Code
# Check the type of a variable
print(type(revenue))
print(type(profit_margin))
print(type(company))
<class 'int'>
<class 'float'>
<class 'str'>

1.1.2 Basic Arithmetic

Code
price = 250        # unit price in dollars
units_sold = 1200  # units sold this quarter

total_sales = price * units_sold
discount = total_sales * 0.05          # 5 % discount
net_sales  = total_sales - discount

print(f"Total Sales : ${total_sales:,}")
print(f"Discount    : ${discount:,.2f}")
print(f"Net Sales   : ${net_sales:,.2f}")
Total Sales : $300,000
Discount    : $15,000.00
Net Sales   : $285,000.00

f-strings (formatted string literals) let you embed variable values directly inside a string using {variable_name}. They are the recommended way to format output in modern Python.

1.1.3 Conditional Statements

Conditional statements let your program make decisions.

Code
# if / elif / else
quarterly_profit = 85_000

if quarterly_profit > 100_000:
    print("Outstanding quarter — bonus approved.")
elif quarterly_profit > 50_000:
    print("Good quarter — on target.")
elif quarterly_profit > 0:
    print("Marginal quarter — review costs.")
else:
    print("Loss this quarter — action required.")
Good quarter — on target.
Code
# Combining conditions with 'and' / 'or'
customer_age  = 35
account_value = 120_000

if customer_age >= 30 and account_value >= 100_000:
    print("Eligible for premium wealth management services.")
else:
    print("Standard account services apply.")
Eligible for premium wealth management services.

1.1.4 Comparison Operators

Operator Meaning Example
== Equal to x == 10
!= Not equal to x != 10
> Greater than sales > 1000
< Less than cost < budget
>= Greater or equal age >= 18
<= Less or equal risk <= 0.05

1.1.5 Student Task 1.1

A retail store applies the following discount policy:

  • Purchase ≥ $500 → 10% discount
  • Purchase ≥ $200 and < $500 → 5% discount
  • Purchase < $200 → no discount

Write a Python program that:

  1. Stores a purchase amount in a variable.
  2. Uses if / elif / else to determine the discount rate.
  3. Calculates and prints the final price after discount.
  4. Test your code with at least three different purchase amounts.
Code
# Your code here
purchase_amount = 350   # change this value to test

# Write your conditional logic below

1.1.6 Evaluation Questions 1.1

  1. What is the output of print(type(3.14))?
    1. <class 'int'>
    2. <class 'float'> (correct)
    3. <class 'str'>
    4. <class 'bool'>
  2. Which operator checks whether two values are equal in Python?
    1. =
    2. ===
    3. == (correct)
    4. !=
  3. In the code if x > 10 and y < 5:, the block executes when:
    1. Either condition is true
    2. Both conditions are true (correct)
    3. Neither condition is true
    4. Only the first condition is true
  4. What does an f-string do?
    1. Forces Python to use floating-point arithmetic
    2. Filters a string for special characters
    3. Embeds variable values inside a string literal (correct)
    4. Formats a file for output
  5. What value does is_profitable = not True store?
    1. True
    2. None
    3. 0
    4. False (correct)

1.2 Section 1.2 — Loops in Python

Loops allow you to repeat actions without rewriting the same code. This is essential when processing large datasets.

1.2.1 The for Loop

Code
# Iterate over a list of items
products = ["Laptop", "Tablet", "Smartphone", "Monitor"]

for product in products:
    print(f"Processing inventory for: {product}")
Processing inventory for: Laptop
Processing inventory for: Tablet
Processing inventory for: Smartphone
Processing inventory for: Monitor
Code
# range() generates a sequence of numbers
# range(start, stop, step)  — stop is exclusive
print("Sales Report — Q1 Weeks")
for week in range(1, 13):          # weeks 1 through 12
    weekly_target = 50_000
    print(f"  Week {week:>2}: Target = ${weekly_target:,}")
Sales Report — Q1 Weeks
  Week  1: Target = $50,000
  Week  2: Target = $50,000
  Week  3: Target = $50,000
  Week  4: Target = $50,000
  Week  5: Target = $50,000
  Week  6: Target = $50,000
  Week  7: Target = $50,000
  Week  8: Target = $50,000
  Week  9: Target = $50,000
  Week 10: Target = $50,000
  Week 11: Target = $50,000
  Week 12: Target = $50,000
Code
# Accumulate a running total
sales_data = [12_000, 18_500, 9_300, 22_100, 15_600]
total = 0

for sale in sales_data:
    total += sale              # shorthand for total = total + sale

print(f"Total Sales: ${total:,}")
print(f"Average Sale: ${total / len(sales_data):,.2f}")
Total Sales: $77,500
Average Sale: $15,500.00

1.2.2 The while Loop

A while loop runs as long as a condition remains True.

Code
# Simulate compounding interest until a target is reached
balance  = 10_000    # initial investment
rate     = 0.07      # 7 % annual return
target   = 20_000
years    = 0

while balance < target:
    balance *= (1 + rate)
    years   += 1

print(f"Investment doubles in {years} years.")
print(f"Final balance: ${balance:,.2f}")
Investment doubles in 11 years.
Final balance: $21,048.52

1.2.3 Loop Control: break and continue

Code
# break — exit the loop early
sales_figures = [8_200, 11_500, 6_800, -500, 14_200, 9_900]

print("Validating sales records:")
for i, sale in enumerate(sales_figures):
    if sale < 0:
        print(f"  ERROR: Negative sale at record {i} — stopping validation.")
        break
    print(f"  Record {i}: ${sale:,} — OK")
Validating sales records:
  Record 0: $8,200 — OK
  Record 1: $11,500 — OK
  Record 2: $6,800 — OK
  ERROR: Negative sale at record 3 — stopping validation.
Code
# continue — skip the current iteration
transactions = [200, -50, 450, -30, 1200, 80]

print("Positive transactions only:")
for t in transactions:
    if t < 0:
        continue                      # skip negative entries
    print(f"  ${t:,}")
Positive transactions only:
  $200
  $450
  $1,200
  $80

1.2.4 List Comprehensions (Compact Loops)

Code
prices = [100, 250, 75, 400, 180]

# Traditional loop
discounted_traditional = []
for p in prices:
    discounted_traditional.append(p * 0.9)

# List comprehension — same result, one line
discounted = [p * 0.9 for p in prices]

print("Original prices:", prices)
print("Discounted (10%):", discounted)
Original prices: [100, 250, 75, 400, 180]
Discounted (10%): [90.0, 225.0, 67.5, 360.0, 162.0]

1.2.5 Student Task 1.2

Your company recorded daily website visitors for two weeks:

[1_250, 980, 1_430, 2_100, 1_890, 760, 430,
 1_320, 1_050, 1_780, 2_250, 1_970, 810, 510]

Using loops, calculate and print:

  1. The total number of visitors over the two weeks.
  2. The average daily visitors (rounded to the nearest whole number).
  3. The number of days with more than 1,500 visitors.
  4. The highest and lowest single-day visitor counts.
Code
# Your code here
daily_visitors = [1_250, 980, 1_430, 2_100, 1_890, 760, 430,
                  1_320, 1_050, 1_780, 2_250, 1_970, 810, 510]

1.2.6 Evaluation Questions 1.2

  1. What does range(2, 10, 2) produce?
    1. [2, 4, 6, 8] (correct)
    2. [2, 4, 6, 8, 10]
    3. [1, 3, 5, 7, 9]
    4. [2, 10, 2]
  2. The statement total += sale is equivalent to:
    1. total = sale
    2. total = total - sale
    3. total = total * sale
    4. total = total + sale (correct)
  3. Which statement immediately exits a loop?
    1. exit
    2. continue
    3. break (correct)
    4. stop
  4. A while loop is best used when:
    1. You need to iterate over a fixed list
    2. The number of iterations depends on a condition (correct)
    3. You always need exactly 10 iterations
    4. You want to iterate over a dictionary
  5. What is the output of [x**2 for x in range(1, 4)]?
    1. [1, 4, 9] (correct)
    2. [1, 2, 3]
    3. [2, 4, 6]
    4. [1, 8, 27]

1.3 Section 1.3 — Lists, Dictionaries, and Tuples

Python’s built-in data structures let you organise and manipulate collections of data — a critical skill before working with datasets.

1.3.1 Lists

A list is an ordered, mutable (changeable) sequence.

Code
# Create and access a list
sales_regions = ["North", "South", "East", "West", "Central"]

print("First region:", sales_regions[0])       # index starts at 0
print("Last region:", sales_regions[-1])        # -1 = last item
print("Regions 1–3:", sales_regions[1:3])       # slicing
First region: North
Last region: Central
Regions 1–3: ['South', 'East']
Code
# Modify a list
quarterly_sales = [120_000, 145_000, 98_000, 162_000]

quarterly_sales.append(175_000)         # add to end
quarterly_sales.insert(0, 110_000)      # insert at position 0
quarterly_sales.remove(98_000)          # remove by value

print("Updated sales:", quarterly_sales)
print("Total periods:", len(quarterly_sales))
print(f"Max quarter: ${max(quarterly_sales):,}")
Updated sales: [110000, 120000, 145000, 162000, 175000]
Total periods: 5
Max quarter: $175,000
Code
# Sorting lists
scores = [88, 72, 95, 61, 84, 99, 77]
scores_sorted = sorted(scores, reverse=True)   # high to low
print("Ranked scores:", scores_sorted)
Ranked scores: [99, 95, 88, 84, 77, 72, 61]

1.3.2 Dictionaries

A dictionary maps keys to values — ideal for structured records.

Code
# Create a dictionary for a customer record
customer = {
    "id"          : "C-10482",
    "name"        : "GlobalTech Ltd",
    "industry"    : "Technology",
    "annual_spend": 285_000,
    "active"      : True
}

# Access values by key
print("Customer:", customer["name"])
print("Industry:", customer["industry"])
print(f"Spend:   ${customer['annual_spend']:,}")
Customer: GlobalTech Ltd
Industry: Technology
Spend:   $285,000
Code
# Update, add, and delete entries
customer["annual_spend"] = 310_000          # update
customer["account_manager"] = "Sarah Lee"   # add new key
del customer["id"]                          # remove key

print(customer)
{'name': 'GlobalTech Ltd', 'industry': 'Technology', 'annual_spend': 310000, 'active': True, 'account_manager': 'Sarah Lee'}
Code
# Iterate over a dictionary
product_inventory = {
    "Laptop"     : 45,
    "Tablet"     : 120,
    "Smartphone" : 89,
    "Monitor"    : 32
}

print("Current Inventory:")
for product, qty in product_inventory.items():
    status = "LOW STOCK" if qty < 40 else "OK"
    print(f"  {product:<12}: {qty:>4} units  [{status}]")
Current Inventory:
  Laptop      :   45 units  [OK]
  Tablet      :  120 units  [OK]
  Smartphone  :   89 units  [OK]
  Monitor     :   32 units  [LOW STOCK]

1.3.3 Tuples

A tuple is like a list but immutable (cannot be changed after creation). Use tuples for fixed data such as coordinates, RGB colours, or database records.

Code
# Tuple examples
location      = (40.7128, -74.0060)        # New York lat/lon
fiscal_year   = (2024, "Q4", "USD")
rgb_brand     = (0, 102, 204)              # company brand colour

print("Headquarters:", location)
print("Fiscal period:", fiscal_year)

# Unpack a tuple into variables
lat, lon = location
print(f"Latitude: {lat}, Longitude: {lon}")
Headquarters: (40.7128, -74.006)
Fiscal period: (2024, 'Q4', 'USD')
Latitude: 40.7128, Longitude: -74.006
Code
# List of tuples — useful for tabular data
transactions = [
    ("2024-01-05", "Invoice #1001", 15_200),
    ("2024-01-12", "Invoice #1002",  8_750),
    ("2024-01-20", "Invoice #1003", 22_400),
]

print(f"{'Date':<12} {'Reference':<18} {'Amount':>10}")
print("-" * 42)
for date, ref, amount in transactions:
    print(f"{date:<12} {ref:<18} ${amount:>9,}")
Date         Reference              Amount
------------------------------------------
2024-01-05   Invoice #1001      $   15,200
2024-01-12   Invoice #1002      $    8,750
2024-01-20   Invoice #1003      $   22,400

1.3.4 Nested Structures

Code
# A list of dictionaries — mimics a simple database table
employees = [
    {"name": "Alice",   "dept": "Sales",   "salary": 72_000},
    {"name": "Bob",     "dept": "Finance", "salary": 85_000},
    {"name": "Carol",   "dept": "Sales",   "salary": 69_000},
    {"name": "David",   "dept": "IT",      "salary": 92_000},
]

# Filter: Sales department only
sales_team = [e for e in employees if e["dept"] == "Sales"]
avg_sales_salary = sum(e["salary"] for e in sales_team) / len(sales_team)

print(f"Average Sales Salary: ${avg_sales_salary:,.2f}")
Average Sales Salary: $70,500.00

1.3.5 Student Task 1.3

You are given the following customer data as a list of dictionaries:

customers = [
    {"name": "Apex Corp",    "region": "East",  "purchases": 34_000},
    {"name": "BlueSky LLC",  "region": "West",  "purchases": 87_500},
    {"name": "CoreTech",     "region": "East",  "purchases": 12_200},
    {"name": "Delta Group",  "region": "West",  "purchases": 56_000},
    {"name": "Edge Systems", "region": "East",  "purchases": 29_800},
]

Write code to:

  1. Print the name and purchase amount of all East region customers.
  2. Calculate and print the total purchases for the West region.
  3. Add a new customer {"name": "Fusion Inc", "region": "North", "purchases": 44_000} to the list.
  4. Find and print the name of the customer with the highest total purchases.
Code
# Your code here
customers = [
    {"name": "Apex Corp",    "region": "East",  "purchases": 34_000},
    {"name": "BlueSky LLC",  "region": "West",  "purchases": 87_500},
    {"name": "CoreTech",     "region": "East",  "purchases": 12_200},
    {"name": "Delta Group",  "region": "West",  "purchases": 56_000},
    {"name": "Edge Systems", "region": "East",  "purchases": 29_800},
]

1.3.6 Evaluation Questions 1.3

  1. What is the index of the first element in a Python list?
    1. 1
    2. -1
    3. 0 (correct)
    4. None
  2. Which method adds an item to the end of a list?
    1. insert()
    2. append() (correct)
    3. add()
    4. push()
  3. What distinguishes a tuple from a list?
    1. Tuples use curly braces
    2. Tuples are faster to print
    3. Tuples cannot be changed after creation (correct)
    4. Tuples can only hold numbers
  4. How do you access the value for key "salary" in a dictionary emp?
    1. emp.salary
    2. emp["salary"] (correct)
    3. emp{salary}
    4. emp->salary
  5. Which expression creates a list of even numbers from 2 to 10?
    1. [x for x in range(1, 10) if x % 2 != 0]
    2. [x for x in range(2, 11, 2)] (correct)
    3. [x for x in range(0, 10)]
    4. [x for x in range(2, 10, 3)]

1.4 Section 1.4 — Introduction to NumPy and Pandas

NumPy and Pandas are the two foundational libraries for data work in Python. NumPy provides fast numerical arrays; Pandas provides spreadsheet-like tables called DataFrames.

1.4.1 NumPy Basics

Code
import numpy as np

# Create arrays
prices   = np.array([199, 299, 149, 399, 249])
units    = np.array([120,  85, 200,  60, 140])

# Element-wise operations (no loop needed!)
revenue  = prices * units
print("Revenue per product:", revenue)
print(f"Total revenue:  ${revenue.sum():,}")
print(f"Average revenue: ${revenue.mean():,.2f}")
print(f"Std deviation:   ${revenue.std():,.2f}")
Revenue per product: [23880 25415 29800 23940 34860]
Total revenue:  $137,895
Average revenue: $27,579.00
Std deviation:   $4,232.11
Code
# 2-D array — think of it as a matrix / mini-table
# Rows: products, Columns: Q1, Q2, Q3, Q4
sales_matrix = np.array([
    [12_000, 15_000, 11_000, 18_000],
    [ 9_500, 10_200,  8_900, 12_500],
    [22_000, 24_500, 20_000, 27_000],
])

print("Sales matrix shape:", sales_matrix.shape)       # (3, 4)
print("Annual totals per product:", sales_matrix.sum(axis=1))
print("Quarterly totals:         ", sales_matrix.sum(axis=0))
Sales matrix shape: (3, 4)
Annual totals per product: [56000 41100 93500]
Quarterly totals:          [43500 49700 39900 57500]

1.4.2 Pandas Basics

Code
import pandas as pd

# Create a DataFrame from a dictionary — like an Excel table in Python
data = {
    "Product"    : ["Laptop", "Tablet", "Smartphone", "Monitor", "Keyboard"],
    "Category"   : ["Electronics", "Electronics", "Electronics", "Electronics", "Accessories"],
    "Price"      : [999, 499, 799, 349, 89],
    "Units_Sold" : [120,  85, 200,  60, 310],
    "Rating"     : [4.5, 4.2, 4.7, 4.0, 4.3],
}

df = pd.DataFrame(data)
print(df)
      Product     Category  Price  Units_Sold  Rating
0      Laptop  Electronics    999         120     4.5
1      Tablet  Electronics    499          85     4.2
2  Smartphone  Electronics    799         200     4.7
3     Monitor  Electronics    349          60     4.0
4    Keyboard  Accessories     89         310     4.3
Code
# Basic DataFrame inspection
print("Shape:", df.shape)                # (rows, columns)
print("\nData types:\n", df.dtypes)
print("\nSummary statistics:")
print(df.describe())
Shape: (5, 5)

Data types:
 Product        object
Category       object
Price           int64
Units_Sold      int64
Rating        float64
dtype: object

Summary statistics:
            Price  Units_Sold    Rating
count    5.000000    5.000000  5.000000
mean   547.000000  155.000000  4.340000
std    360.236034  101.488916  0.270185
min     89.000000   60.000000  4.000000
25%    349.000000   85.000000  4.200000
50%    499.000000  120.000000  4.300000
75%    799.000000  200.000000  4.500000
max    999.000000  310.000000  4.700000
Code
# Computed columns
df["Revenue"] = df["Price"] * df["Units_Sold"]
df["Revenue_Share"] = (df["Revenue"] / df["Revenue"].sum() * 100).round(1)

print(df[["Product", "Revenue", "Revenue_Share"]])
      Product  Revenue  Revenue_Share
0      Laptop   119880           32.3
1      Tablet    42415           11.4
2  Smartphone   159800           43.1
3     Monitor    20940            5.6
4    Keyboard    27590            7.4
Code
# Filtering rows
high_rating = df[df["Rating"] >= 4.5]
print("\nTop-rated products:")
print(high_rating[["Product", "Rating", "Revenue"]])

Top-rated products:
      Product  Rating  Revenue
0      Laptop     4.5   119880
2  Smartphone     4.7   159800
Code
# Sorting
top_revenue = df.sort_values("Revenue", ascending=False)
print("\nProducts ranked by revenue:")
print(top_revenue[["Product", "Revenue"]].to_string(index=False))

Products ranked by revenue:
   Product  Revenue
Smartphone   159800
    Laptop   119880
    Tablet    42415
  Keyboard    27590
   Monitor    20940
Code
# Grouping and aggregation
category_summary = df.groupby("Category").agg(
    Total_Revenue  = ("Revenue", "sum"),
    Avg_Rating     = ("Rating",  "mean"),
    Product_Count  = ("Product", "count")
).reset_index()

print("\nCategory Summary:")
print(category_summary)

Category Summary:
      Category  Total_Revenue  Avg_Rating  Product_Count
0  Accessories          27590        4.30              1
1  Electronics         343035        4.35              4

1.4.3 Reading Data from Files

In practice, data arrives as CSV files or Excel spreadsheets.

Code
# Reading a CSV file (not run — example only)
df_sales = pd.read_csv("sales_data.csv")

# Reading an Excel file
df_sales = pd.read_excel("sales_data.xlsx", sheet_name="Q1")

# Quick look at the first 5 rows
df_sales.head()

1.4.4 Student Task 1.4

Create a Pandas DataFrame representing five employees with the following columns: Name, Department, Years_Experience, Salary.

Then write code to:

  1. Display basic summary statistics for numeric columns.
  2. Add a column Salary_Grade"Senior" if Years_Experience >= 5, else "Junior".
  3. Filter and display only Senior employees.
  4. Calculate the average salary by department.
  5. Sort and display all employees from highest to lowest salary.
Code
# Your code here
import pandas as pd

# Create your employee DataFrame here

1.4.5 Evaluation Questions 1.4

  1. Which NumPy method calculates the mean of an array?
    1. np.total()
    2. np.mean() (correct)
    3. np.avg()
    4. np.center()
  2. What does df.shape return for a DataFrame with 100 rows and 5 columns?
    1. [100, 5]
    2. 100 x 5
    3. (100, 5) (correct)
    4. (5, 100)
  3. Which method shows the first 5 rows of a DataFrame?
    1. df.top()
    2. df.first()
    3. df.show()
    4. df.head() (correct)
  4. df[df["Sales"] > 10000] is an example of:
    1. Sorting a DataFrame
    2. Filtering rows based on a condition (correct)
    3. Deleting rows with values over 10,000
    4. Replacing values over 10,000
  5. What does df.groupby("Region").agg({"Sales": "sum"}) produce?
    1. Individual rows where Region equals “Sales”
    2. Total sales for each region (correct)
    3. Average sales across all regions
    4. A sorted list of regions

2 Module 2: Exploratory Data Analysis (EDA)

Before building any model, you must understand your data. EDA is the process of examining datasets to summarise their main characteristics, spot problems, and uncover patterns.

2.1 Section 2.1 — Handling Missing Data

Real-world business data is almost always incomplete. Learning how to detect and handle missing values is a fundamental skill.

Code
import pandas as pd
import numpy as np

# Simulate a customer dataset with missing values
np.random.seed(42)

n = 200
data = {
    "CustomerID"  : range(1001, 1001 + n),
    "Age"         : np.where(np.random.rand(n) < 0.08, np.nan,
                             np.random.randint(22, 70, n).astype(float)),
    "Income"      : np.where(np.random.rand(n) < 0.12, np.nan,
                             np.random.normal(65_000, 20_000, n).round(-2)),
    "Purchases"   : np.random.randint(1, 50, n),
    "Segment"     : np.where(np.random.rand(n) < 0.05, np.nan,
                             np.random.choice(["Bronze","Silver","Gold"], n)),
}

df = pd.DataFrame(data)
print("Dataset shape:", df.shape)
print(df.head())
Dataset shape: (200, 5)
   CustomerID   Age   Income  Purchases Segment
0        1001  45.0  79200.0         47    Gold
1        1002  32.0  42500.0         26  Bronze
2        1003  29.0  34300.0         46  Silver
3        1004  57.0  90600.0         43  Silver
4        1005  59.0  71600.0         12  Bronze

2.1.1 Detecting Missing Values

Code
# Count missing values per column
missing = df.isnull().sum()
pct_missing = (missing / len(df) * 100).round(1)

missing_report = pd.DataFrame({
    "Missing_Count"  : missing,
    "Missing_Pct_%"  : pct_missing
})
print(missing_report[missing_report["Missing_Count"] > 0])
        Missing_Count  Missing_Pct_%
Age                18            9.0
Income             35           17.5
Code
# Visualise missingness pattern
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 3))
ax.bar(missing_report.index, missing_report["Missing_Pct_%"], color="steelblue")
ax.set_title("Percentage of Missing Values per Column")
ax.set_ylabel("Missing (%)")
ax.set_xlabel("Column")
plt.tight_layout()
plt.show()

2.1.2 Strategies for Handling Missing Data

Strategy When to Use
Drop rows Very few rows affected and data is large
Fill with mean/median Numerical columns, missing at random
Fill with mode Categorical columns
Forward/backward fill Time-series data
Predictive imputation Advanced; missing not at random
Code
df_clean = df.copy()

# 1. Fill numeric columns with median (robust to outliers)
df_clean["Age"]    = df_clean["Age"].fillna(df_clean["Age"].median())
df_clean["Income"] = df_clean["Income"].fillna(df_clean["Income"].median())

# 2. Fill categorical column with mode (most frequent value)
df_clean["Segment"] = df_clean["Segment"].fillna(df_clean["Segment"].mode()[0])

# Verify no missing values remain
print("Missing after cleaning:", df_clean.isnull().sum().sum())
print("\nMedian Age used for imputation:", df["Age"].median())
print(f"Median Income used:            ${df['Income'].median():,.0f}")
Missing after cleaning: 0

Median Age used for imputation: 47.5
Median Income used:            $64,000
Code
# Alternative: drop rows with any missing values (use when data is abundant)
df_dropped = df.dropna()
print(f"Rows before: {len(df)}, after dropping NAs: {len(df_dropped)}")
Rows before: 200, after dropping NAs: 152

2.1.3 Student Task 2.1

Run the cell below to create a sales dataset with missing values. Then:

  1. Report which columns have missing values and what percentage is missing.
  2. Choose an appropriate strategy for each column and justify your choice.
  3. Apply your chosen strategy to produce a clean dataset df_sales_clean.
  4. Verify the clean dataset has zero missing values.
Code
# Dataset provided — do not change this cell
np.random.seed(7)
m = 150
df_sales = pd.DataFrame({
    "OrderID"    : range(5001, 5001 + m),
    "Region"     : np.where(np.random.rand(m) < 0.06, np.nan,
                            np.random.choice(["North","South","East","West"], m)),
    "Sales"      : np.where(np.random.rand(m) < 0.10, np.nan,
                            np.random.uniform(500, 50_000, m).round(2)),
    "Quantity"   : np.random.randint(1, 100, m),
    "Discount"   : np.where(np.random.rand(m) < 0.15, np.nan,
                            np.random.uniform(0, 0.4, m).round(2)),
})

# Your cleaning code here

2.1.4 Evaluation Questions 2.1

  1. Which method returns a Boolean DataFrame showing where values are missing?
    1. df.missing()
    2. df.isna() (correct)
    3. df.nullcheck()
    4. df.find_nan()
  2. When is replacing missing values with the median preferred over the mean?
    1. When there are no outliers
    2. When the data is perfectly symmetric
    3. When outliers are present in the column (correct)
    4. When the column contains text
  3. Filling missing values using values from the previous row is called:
    1. Backward fill
    2. Mean imputation
    3. Forward fill (correct)
    4. Random imputation
  4. If 40% of values in a column are missing, which action is most appropriate?
    1. Fill with the mean — always safe
    2. Drop all rows with missing values
    3. Investigate why data is missing and consider dropping the column (correct)
    4. Replace with zero
  5. df.dropna() removes:
    1. Columns with missing values
    2. Only rows where all values are NaN
    3. Any row that contains at least one missing value (correct)
    4. Zero values

2.2 Section 2.2 — Scaling and Normalising Data

Machine learning algorithms are sensitive to the scale of your features. A salary column (range: 30,000–200,000) would dominate an age column (range: 20–65) unless we rescale them.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample employee dataset
np.random.seed(0)
n = 300
df_emp = pd.DataFrame({
    "Age"        : np.random.randint(22, 62, n),
    "Salary"     : np.random.normal(75_000, 20_000, n).clip(30_000, 150_000),
    "Experience" : np.random.randint(0, 35, n),
    "Score"      : np.random.uniform(50, 100, n).round(1),
})

print("Raw data statistics:")
print(df_emp.describe().round(2))
Raw data statistics:
          Age     Salary  Experience   Score
count  300.00     300.00      300.00  300.00
mean    41.38   72839.86       16.83   74.91
std     11.82   19617.65       10.05   14.93
min     22.00   30000.00        0.00   50.00
25%     31.00   58984.45        9.00   62.10
50%     42.00   73406.45       17.00   74.40
75%     52.00   86645.98       26.00   87.90
max     61.00  137030.61       34.00  100.00

2.2.1 Min-Max Normalisation

Scales every value to the range [0, 1].

\[x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}\]

Code
scaler_mm = MinMaxScaler()
df_minmax = pd.DataFrame(
    scaler_mm.fit_transform(df_emp),
    columns=df_emp.columns
)

print("After Min-Max Scaling:")
print(df_minmax.describe().round(3))
After Min-Max Scaling:
           Age   Salary  Experience    Score
count  300.000  300.000     300.000  300.000
mean     0.497    0.400       0.495    0.498
std      0.303    0.183       0.295    0.299
min      0.000    0.000       0.000    0.000
25%      0.231    0.271       0.265    0.242
50%      0.513    0.406       0.500    0.488
75%      0.769    0.529       0.765    0.758
max      1.000    1.000       1.000    1.000

2.2.2 Standardisation (Z-score Scaling)

Centres data at mean = 0 with standard deviation = 1.

\[x_{std} = \frac{x - \mu}{\sigma}\]

Code
scaler_std = StandardScaler()
df_std = pd.DataFrame(
    scaler_std.fit_transform(df_emp),
    columns=df_emp.columns
)

print("After Standardisation:")
print(df_std.describe().round(3))
After Standardisation:
           Age   Salary  Experience    Score
count  300.000  300.000     300.000  300.000
mean    -0.000    0.000       0.000   -0.000
std      1.002    1.002       1.002    1.002
min     -1.642   -2.187      -1.678   -1.672
25%     -0.880   -0.707      -0.781   -0.860
50%      0.053    0.029       0.017   -0.034
75%      0.900    0.705       0.914    0.872
max      1.662    3.278       1.712    1.684

2.2.3 Comparing Distributions

Code
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for ax, data, title in zip(axes,
                            [df_emp["Salary"],
                             df_minmax["Salary"],
                             df_std["Salary"]],
                            ["Raw Salary",
                             "Min-Max Scaled",
                             "Standardised"]):
    ax.hist(data, bins=30, color="steelblue", edgecolor="white")
    ax.set_title(title)
    ax.set_xlabel("Value")

plt.suptitle("Effect of Scaling on Salary Distribution", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Key insight: Scaling changes the range of values but not the shape of the distribution.

2.2.4 When to Use Which?

Method Use When
Min-Max You need values in a fixed range [0,1]; no extreme outliers
Standardisation Algorithm assumes normality (e.g., logistic regression, SVM)
No scaling Tree-based models (Decision Trees, Random Forests)

2.2.5 Student Task 2.2

Using the df_emp dataset from above:

  1. Apply Min-Max scaling to only the Salary and Score columns (leave others unchanged).
  2. Apply Standardisation to Age and Experience.
  3. Print the mean and standard deviation of each scaled column to verify the transformations worked correctly.
  4. Explain in one sentence why scaling is important before training a k-nearest-neighbours model.
Code
# Your code here — use df_emp from the section above

2.2.6 Evaluation Questions 2.2

  1. After Min-Max scaling, what is the range of values?
    1. −1 to 1
    2. 0 to 100
    3. 0 to 1 (correct)
    4. −3 to 3
  2. After Standardisation, what is the approximate mean of each feature?
    1. 1
    2. 0.5
    3. 0 (correct)
    4. It depends on the data
  3. Which type of model generally does NOT require feature scaling?
    1. Logistic Regression
    2. Support Vector Machine
    3. K-Nearest Neighbours
    4. Decision Tree (correct)
  4. What is the formula for a z-score?
    1. \((x - x_{min}) / (x_{max} - x_{min})\)
    2. \((x - \mu) / \sigma\) (correct)
    3. \(x / x_{max}\)
    4. \((x - \sigma) / \mu\)
  5. Which sklearn class is used for Standardisation?
    1. MinMaxScaler
    2. Normalizer
    3. StandardScaler (correct)
    4. RobustScaler

2.3 Section 2.3 — Identifying Key Features

Feature selection identifies which variables (features) are most important for predicting an outcome. Fewer, more relevant features produce faster, more interpretable models.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(123)
n = 500

df_cust = pd.DataFrame({
    "Age"             : np.random.randint(18, 70, n),
    "Income"          : np.random.normal(60_000, 25_000, n).clip(20_000, 200_000),
    "Tenure_Months"   : np.random.randint(1, 120, n),
    "Num_Products"    : np.random.randint(1, 8, n),
    "Web_Visits"      : np.random.randint(0, 50, n),
    "Complaints"      : np.random.poisson(0.5, n),
    "Satisfaction"    : np.random.uniform(1, 10, n).round(1),
})

# Target: will the customer churn? (influenced by satisfaction and complaints)
df_cust["Churned"] = (
    (df_cust["Satisfaction"] < 5) |
    (df_cust["Complaints"]   > 2)
).astype(int)

print("Dataset shape:", df_cust.shape)
print("Churn rate: {:.1%}".format(df_cust["Churned"].mean()))
Dataset shape: (500, 8)
Churn rate: 41.6%

2.3.1 Correlation Analysis

Code
corr_matrix = df_cust.corr()

fig, ax = plt.subplots(figsize=(9, 7))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f",
            cmap="coolwarm", center=0, ax=ax,
            square=True, linewidths=0.5)
ax.set_title("Feature Correlation Matrix", fontsize=14, pad=15)
plt.tight_layout()
plt.show()

Code
# Focus on correlation with the target variable
target_corr = corr_matrix["Churned"].drop("Churned").sort_values(key=abs, ascending=False)
print("Correlation with Churn (sorted by strength):")
print(target_corr.round(3).to_string())
Correlation with Churn (sorted by strength):
Satisfaction    -0.858
Tenure_Months    0.081
Complaints       0.066
Income           0.046
Age              0.034
Num_Products    -0.007
Web_Visits       0.000

2.3.2 Variance Analysis

Features with near-zero variance carry little information.

Code
feature_variance = df_cust.drop(columns="Churned").var().sort_values(ascending=False)
print("Feature Variance:")
print(feature_variance.round(2).to_string())
Feature Variance:
Income           5.925131e+08
Tenure_Months    1.262770e+03
Age              2.253900e+02
Web_Visits       2.027500e+02
Satisfaction     6.790000e+00
Num_Products     4.000000e+00
Complaints       4.600000e-01

2.3.3 Feature Importance via Random Forest

Code
from sklearn.ensemble import RandomForestClassifier

X = df_cust.drop(columns="Churned")
y = df_cust["Churned"]

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

importance_df = pd.DataFrame({
    "Feature"   : X.columns,
    "Importance": rf.feature_importances_
}).sort_values("Importance", ascending=False)

fig, ax = plt.subplots(figsize=(8, 4))
ax.barh(importance_df["Feature"], importance_df["Importance"], color="teal")
ax.set_xlabel("Importance Score")
ax.set_title("Feature Importance (Random Forest)")
ax.invert_yaxis()
plt.tight_layout()
plt.show()

Code
print("Top 3 most important features:")
print(importance_df.head(3).to_string(index=False))
Top 3 most important features:
      Feature  Importance
 Satisfaction    0.884742
Tenure_Months    0.025001
       Income    0.023793

2.3.4 Student Task 2.3

Using df_cust:

  1. Identify all features that have an absolute correlation > 0.3 with Churned.
  2. Create a bar chart showing the correlation of each feature with Churned.
  3. Based on the Random Forest importance plot, which two features would you prioritise for a churn prediction model? Justify your choice.
  4. What does it mean if a feature has a negative correlation with churn?
Code
# Your code here

2.3.5 Evaluation Questions 2.3

  1. A correlation of −0.75 between two variables indicates:
    1. No relationship
    2. A weak positive relationship
    3. A strong positive relationship
    4. A strong negative relationship (correct)
  2. Feature importance from a Random Forest measures:
    1. How large a feature’s values are
    2. How much each feature reduces prediction error (correct)
    3. The correlation between a feature and the target
    4. The number of unique values in a feature
  3. A feature with near-zero variance should likely be:
    1. Normalised before use
    2. Kept as the primary predictor
    3. Removed — it carries little information (correct)
    4. Multiplied by the target variable
  4. Why is feature selection important in business ML models?
    1. It always improves accuracy significantly
    2. It reduces model complexity and improves interpretability (correct)
    3. It automatically handles missing values
    4. It is required by all ML algorithms
  5. Which seaborn function creates a correlation heatmap?
    1. sns.corrplot()
    2. sns.matrix()
    3. sns.heatmap() (correct)
    4. sns.pairplot()

2.4 Section 2.4 — Data Visualisation for EDA

Visualisation transforms numbers into insights. We use Matplotlib for fine-grained control and Seaborn for attractive statistical charts.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")

# Retail sales dataset
np.random.seed(99)
n = 400
df_retail = pd.DataFrame({
    "Month"        : np.random.choice(range(1, 13), n),
    "Category"     : np.random.choice(["Electronics","Apparel","Grocery","Home"], n),
    "Sales"        : np.random.lognormal(10, 0.6, n).round(2),
    "Discount_Pct" : np.random.uniform(0, 0.5, n).round(2),
    "Customer_Age" : np.random.randint(18, 70, n),
})

2.4.1 Histograms — Understand Distributions

Code
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(df_retail["Sales"], bins=40, color="steelblue", edgecolor="white")
axes[0].set_title("Distribution of Sales")
axes[0].set_xlabel("Sales ($)")
axes[0].set_ylabel("Frequency")

sns.histplot(df_retail["Customer_Age"], bins=25, kde=True,
             color="coral", ax=axes[1])
axes[1].set_title("Customer Age Distribution")
axes[1].set_xlabel("Age")

plt.tight_layout()
plt.show()

2.4.2 Box Plots — Spot Outliers and Compare Groups

Code
fig, ax = plt.subplots(figsize=(9, 5))
sns.boxplot(data=df_retail, x="Category", y="Sales", palette="Set2", ax=ax)
ax.set_title("Sales Distribution by Product Category")
ax.set_xlabel("Category")
ax.set_ylabel("Sales ($)")
plt.tight_layout()
plt.show()

2.4.3 Scatter Plots — Explore Relationships

Code
fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=df_retail, x="Discount_Pct", y="Sales",
                hue="Category", alpha=0.5, ax=ax)
ax.set_title("Sales vs Discount Percentage by Category")
ax.set_xlabel("Discount (%)")
ax.set_ylabel("Sales ($)")
plt.tight_layout()
plt.show()

2.4.4 Bar Charts — Compare Aggregates

Code
category_sales = df_retail.groupby("Category")["Sales"].sum().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(8, 4))
category_sales.plot(kind="bar", color="teal", edgecolor="white", ax=ax)
ax.set_title("Total Sales by Category")
ax.set_xlabel("Category")
ax.set_ylabel("Total Sales ($)")
ax.tick_params(axis="x", rotation=0)
plt.tight_layout()
plt.show()

2.4.5 Pair Plot — Multi-feature Overview

Code
sns.pairplot(df_retail[["Sales", "Discount_Pct", "Customer_Age"]],
             diag_kind="kde", plot_kws={"alpha": 0.3})
plt.suptitle("Pair Plot — Retail Dataset", y=1.01, fontsize=13)
plt.show()


2.4.6 Student Task 2.4

Using df_retail:

  1. Create a line chart showing average monthly sales (x-axis = Month, y-axis = average Sales). Does any seasonal pattern emerge?
  2. Create a box plot comparing the distribution of Discount_Pct across categories.
  3. Add a trend line to the scatter plot of Discount_Pct vs Sales using sns.regplot. What does the slope tell you about the relationship?
  4. Write three business insights you can draw from the visualisations.
Code
# Your code here

2.4.7 Evaluation Questions 2.4

  1. Which chart type best shows the distribution of a single continuous variable?
    1. Bar chart
    2. Scatter plot
    3. Histogram (correct)
    4. Pie chart
  2. Box plots are especially useful for:
    1. Showing time-series trends
    2. Comparing category proportions
    3. Identifying outliers and comparing group distributions (correct)
    4. Displaying correlation coefficients
  3. In a scatter plot, what does a positive slope indicate?
    1. As x increases, y decreases
    2. As x increases, y increases (correct)
    3. There is no relationship between x and y
    4. Both variables have the same scale
  4. What does the kde=True argument add to sns.histplot()?
    1. A key-density encryption layer
    2. A smooth probability density curve overlaid on the histogram (correct)
    3. An interactive zooming feature
    4. K-means clustering
  5. df.groupby("Category")["Sales"].mean() returns:
    1. A single overall average
    2. The average sales for each category (correct)
    3. The total sales per category
    4. The median sales for all rows

3 Module 3: Introduction to Machine Learning

Machine learning (ML) enables computers to learn patterns from data and make predictions without being explicitly programmed for every case.

3.1 Section 3.1 — ML Concepts and Workflow

3.1.1 What is Machine Learning?

Input Data  ──►  ML Algorithm  ──►  Trained Model  ──►  Predictions
Type Description Business Example
Supervised Learn from labelled examples Predict customer churn (Yes/No)
Unsupervised Find hidden patterns Customer segmentation
Reinforcement Learn through reward/penalty Dynamic pricing engines

3.1.2 The ML Workflow

Code
# Step 1 — Load and inspect data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error

np.random.seed(42)
n = 600

df_loan = pd.DataFrame({
    "Income"       : np.random.normal(55_000, 20_000, n).clip(20_000, 150_000).round(-2),
    "Loan_Amount"  : np.random.normal(25_000, 10_000, n).clip(5_000, 80_000).round(-2),
    "Credit_Score" : np.random.randint(500, 850, n),
    "Age"          : np.random.randint(22, 65, n),
    "Employment_Yrs": np.random.randint(0, 30, n),
})

# Target: loan approved (1) or denied (0)
df_loan["Approved"] = (
    (df_loan["Credit_Score"] > 650) &
    (df_loan["Income"] > 40_000)
).astype(int)

print("Dataset shape:", df_loan.shape)
print("Approval rate: {:.1%}".format(df_loan["Approved"].mean()))
print(df_loan.head())
Dataset shape: (600, 6)
Approval rate: 40.3%
    Income  Loan_Amount  Credit_Score  Age  Employment_Yrs  Approved
0  64900.0      32600.0           672   60              12         1
1  52200.0      15800.0           546   60              21         0
2  68000.0      33700.0           640   56              20         0
3  85500.0      38600.0           617   22              26         0
4  50300.0      29100.0           776   45              22         1
Code
# Step 2 — Split into features (X) and target (y)
X = df_loan.drop(columns="Approved")
y = df_loan["Approved"]

# Step 3 — Train/test split (80 % train, 20 % test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples : {len(X_train)}")
print(f"Testing samples  : {len(X_test)}")
print(f"Train approval rate: {y_train.mean():.2%}")
print(f"Test  approval rate: {y_test.mean():.2%}")
Training samples : 480
Testing samples  : 120
Train approval rate: 40.42%
Test  approval rate: 40.00%
Code
# Step 4 — Scale features
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)   # fit on train, transform train
X_test_sc  = scaler.transform(X_test)        # transform test using train stats

Critical rule: Always fit the scaler on training data only, then apply it to both train and test sets. Fitting on test data would cause data leakage.

3.1.3 Evaluating Model Performance

Code
import matplotlib.pyplot as plt

# Illustrate the bias-variance trade-off concept
complexity = list(range(1, 11))
train_err  = [0.40, 0.28, 0.18, 0.10, 0.06, 0.04, 0.02, 0.01, 0.01, 0.01]
test_err   = [0.42, 0.30, 0.22, 0.16, 0.14, 0.15, 0.18, 0.23, 0.30, 0.38]

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(complexity, train_err, "b-o", label="Training Error")
ax.plot(complexity, test_err,  "r-o", label="Test Error")
ax.axvline(x=5, color="green", linestyle="--", label="Optimal Complexity")
ax.set_xlabel("Model Complexity")
ax.set_ylabel("Error")
ax.set_title("Bias-Variance Trade-off")
ax.legend()
plt.tight_layout()
plt.show()


3.1.4 Student Task 3.1

Using the df_loan dataset:

  1. Check for missing values and confirm the data is clean.
  2. Show the class balance of the Approved column with a bar chart.
  3. Perform a train/test split (75% / 25%) and print the number of rows in each set.
  4. Apply StandardScaler to the training set. Verify the mean and standard deviation of the scaled training data.
  5. Why is it important to use stratify=y in the train/test split?
Code
# Your code here

3.1.5 Evaluation Questions 3.1

  1. Which step must come before fitting a scaler?
    1. Testing the model
    2. Splitting data into train and test sets (correct)
    3. Removing the target column from training data
    4. Encoding categorical variables
  2. Data leakage occurs when:
    1. You use too few training examples
    2. Test set information influences model training (correct)
    3. You apply too many scaling methods
    4. Your model has too many layers
  3. The test_size=0.2 argument means:
    1. 20 rows are used for testing
    2. 20 % of the data is held out for testing (correct)
    3. Testing runs 20 times
    4. 2 features are selected for testing
  4. In supervised learning, the label or target variable is:
    1. Any continuous feature
    2. The variable you are trying to predict (correct)
    3. A feature you must remove before training
    4. The first column of the DataFrame
  5. The bias-variance trade-off describes the balance between:
    1. Speed and accuracy
    2. Overfitting (high variance) and underfitting (high bias) (correct)
    3. Training data size and test data size
    4. Number of features and number of rows

3.2 Section 3.2 — Simple Regression Models

Regression predicts a continuous numerical outcome (e.g., revenue, house price, demand).

3.2.1 Linear Regression

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n\]

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

np.random.seed(5)
n = 400

# House price dataset
df_house = pd.DataFrame({
    "Size_sqft"  : np.random.randint(500, 4000, n),
    "Bedrooms"   : np.random.randint(1, 6, n),
    "Age_years"  : np.random.randint(0, 50, n),
    "Distance_km": np.random.uniform(1, 30, n).round(1),
})

# Price = true relationship + noise
df_house["Price"] = (
      250 * df_house["Size_sqft"]
    + 15_000 * df_house["Bedrooms"]
    - 2_000 * df_house["Age_years"]
    - 5_000 * df_house["Distance_km"]
    + 80_000
    + np.random.normal(0, 30_000, n)
).round(-2)

print("House dataset:")
print(df_house.describe().round(0))
House dataset:
       Size_sqft  Bedrooms  Age_years  Distance_km      Price
count      400.0     400.0      400.0        400.0      400.0
mean      2155.0       3.0       25.0         15.0   536693.0
std       1022.0       1.0       15.0          9.0   261585.0
min        505.0       1.0        0.0          1.0    41200.0
25%       1276.0       2.0       12.0          8.0   309700.0
50%       2184.0       3.0       26.0         16.0   533800.0
75%       3042.0       4.0       38.0         23.0   761400.0
max       3986.0       5.0       49.0         30.0  1079300.0
Code
X = df_house.drop(columns="Price")
y = df_house["Price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# Train the model
model_lr = LinearRegression()
model_lr.fit(X_train_sc, y_train)

# Predict
y_pred = model_lr.predict(X_test_sc)
Code
# Evaluate performance
mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

print(f"Mean Absolute Error  : ${mae:,.0f}")
print(f"Root Mean Sq. Error  : ${rmse:,.0f}")
print(f"R² Score             : {r2:.3f}")
Mean Absolute Error  : $25,195
Root Mean Sq. Error  : $31,777
R² Score             : 0.986
Code
# Visualise actual vs predicted
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(y_test, y_pred, alpha=0.4, color="steelblue")
ax.plot([y_test.min(), y_test.max()],
        [y_test.min(), y_test.max()], "r--", label="Perfect prediction")
ax.set_xlabel("Actual Price ($)")
ax.set_ylabel("Predicted Price ($)")
ax.set_title("Linear Regression — Actual vs Predicted")
ax.legend()
plt.tight_layout()
plt.show()

Code
# Regression coefficients — feature impact
coef_df = pd.DataFrame({
    "Feature"    : X.columns,
    "Coefficient": model_lr.coef_
}).sort_values("Coefficient", ascending=False)

print("Feature Coefficients (standardised):")
print(coef_df.to_string(index=False))
print("\nInterpretation: larger absolute value = stronger influence on price.")
Feature Coefficients (standardised):
    Feature   Coefficient
  Size_sqft 252790.778125
   Bedrooms  23421.859369
  Age_years -28962.174490
Distance_km -43425.657642

Interpretation: larger absolute value = stronger influence on price.

3.2.2 Interpreting Regression Metrics

Metric Formula Interpretation
MAE mean( actual − predicted
RMSE √mean((actual−predicted)²) Penalises large errors more
1 − SS_res/SS_tot 0 = no fit; 1 = perfect fit

3.2.3 Student Task 3.2

A marketing team wants to predict next month’s advertising spend required to achieve a target sales volume.

  1. Create a synthetic dataset (100 rows) with features Ad_Budget, Season (encode as 1–4), and Competitors (integer), and a target Sales.
  2. Train a LinearRegression model and evaluate with MAE, RMSE, and R².
  3. Plot Actual vs Predicted sales.
  4. Which feature has the largest coefficient? What does that mean for the business?
Code
# Your code here

3.2.4 Evaluation Questions 3.2

  1. R² = 0.85 means the model explains what percentage of variance in the target?
    1. 15 %
    2. 85 % (correct)
    3. 8.5 %
    4. 0.85 %
  2. Which metric is most sensitive to large prediction errors?
    1. MAE
    2. RMSE (correct)
    3. Accuracy
  3. Linear regression assumes the relationship between features and target is:
    1. Exponential
    2. Linear (correct)
    3. Circular
    4. Random
  4. model.coef_ returns:
    1. The model’s accuracy score
    2. The number of training iterations
    3. The learned weights for each feature (correct)
    4. The predicted values
  5. If a regression coefficient for Ad_Budget is positive, it means:
    1. Higher ad spend predicts lower sales
    2. Higher ad spend predicts higher sales (correct)
    3. Ad budget is unrelated to sales
    4. Ad budget should be removed from the model

3.3 Section 3.3 — Simple Classification Models

Classification predicts a category (e.g., yes/no, tier A/B/C, fraud/not fraud).

3.3.1 Logistic Regression

Despite the name, logistic regression is a classification algorithm. It outputs the probability that an observation belongs to a class.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, classification_report,
                              confusion_matrix, ConfusionMatrixDisplay)

np.random.seed(42)
n = 800

# Customer churn dataset
df_churn = pd.DataFrame({
    "Tenure_Months" : np.random.randint(1, 72, n),
    "Monthly_Spend" : np.random.normal(70, 25, n).clip(20, 200).round(2),
    "Support_Calls" : np.random.poisson(1.5, n),
    "Satisfaction"  : np.random.uniform(1, 10, n).round(1),
    "Num_Products"  : np.random.randint(1, 6, n),
})

df_churn["Churned"] = (
    (df_churn["Satisfaction"] < 4.5) |
    (df_churn["Support_Calls"] > 4)
).astype(int)

X = df_churn.drop(columns="Churned")
y = df_churn["Churned"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

print("Dataset shape:", df_churn.shape)
print("Churn rate: {:.1%}".format(y.mean()))
Dataset shape: (800, 6)
Churn rate: 40.1%
Code
# Train Logistic Regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_sc, y_train)
y_pred_lr = log_reg.predict(X_test_sc)

print("=== Logistic Regression ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.2%}")
print(classification_report(y_test, y_pred_lr, target_names=["Stayed","Churned"]))
=== Logistic Regression ===
Accuracy: 94.50%
              precision    recall  f1-score   support

      Stayed       0.95      0.96      0.95       120
     Churned       0.94      0.93      0.93        80

    accuracy                           0.94       200
   macro avg       0.94      0.94      0.94       200
weighted avg       0.94      0.94      0.94       200
Code
# Confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for ax, model, name in zip(axes,
    [log_reg, KNeighborsClassifier(n_neighbors=7).fit(X_train_sc, y_train)],
    ["Logistic Regression", "K-Nearest Neighbours (k=7)"]):

    preds = model.predict(X_test_sc)
    cm    = confusion_matrix(y_test, preds)
    disp  = ConfusionMatrixDisplay(cm, display_labels=["Stayed","Churned"])
    disp.plot(ax=ax, colorbar=False, cmap="Blues")
    ax.set_title(f"{name}\nAccuracy: {accuracy_score(y_test, preds):.2%}")

plt.tight_layout()
plt.show()

3.3.2 Understanding Classification Metrics

Code
# Manual illustration
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred_knn = KNeighborsClassifier(n_neighbors=7).fit(
    X_train_sc, y_train).predict(X_test_sc)

metrics_df = pd.DataFrame({
    "Model"     : ["Logistic Regression", "KNN (k=7)"],
    "Accuracy"  : [accuracy_score(y_test, y_pred_lr),
                   accuracy_score(y_test, y_pred_knn)],
    "Precision" : [precision_score(y_test, y_pred_lr),
                   precision_score(y_test, y_pred_knn)],
    "Recall"    : [recall_score(y_test, y_pred_lr),
                   recall_score(y_test, y_pred_knn)],
    "F1-Score"  : [f1_score(y_test, y_pred_lr),
                   f1_score(y_test, y_pred_knn)],
}).set_index("Model")

print(metrics_df.round(3))
                     Accuracy  Precision  Recall  F1-Score
Model                                                     
Logistic Regression     0.945      0.937   0.925     0.931
KNN (k=7)               0.920      0.932   0.862     0.896

Business context: In churn prediction, Recall (catching actual churners) is often more important than Precision — missing a churner is costlier than an unnecessary retention call.


3.3.3 Student Task 3.3

A bank wants to classify loan applications as Approved or Denied.

  1. Re-use the df_loan dataset from Section 3.1.
  2. Train both a LogisticRegression and a KNeighborsClassifier (k=5).
  3. Compare Accuracy, Precision, Recall, and F1-Score for both models.
  4. Plot confusion matrices for both models side by side.
  5. Which model would you recommend to the bank’s risk team and why?
Code
# Your code here
# Reload df_loan from Section 3.1 if needed

3.3.4 Evaluation Questions 3.3

  1. Logistic Regression outputs a:
    1. Continuous value like revenue
    2. Probability between 0 and 1 (correct)
    3. Cluster label
    4. Feature importance score
  2. Recall (sensitivity) measures:
    1. Of all predicted positives, how many are correct
    2. Of all actual positives, how many were correctly predicted (correct)
    3. The overall fraction of correct predictions
    4. The harmonic mean of precision and recall
  3. A confusion matrix shows:
    1. Which features are most confusing for the model
    2. True and false positives and negatives (correct)
    3. The correlation between features
    4. Model training time
  4. In which business scenario is high recall most critical?
    1. Recommending products to customers
    2. Predicting email open rates
    3. Detecting fraudulent transactions (correct)
    4. Forecasting annual revenue
  5. KNN classifies a new point by:
    1. Fitting a straight decision boundary
    2. Looking at the k closest training examples and taking a majority vote (correct)
    3. Building a tree of decision rules
    4. Computing the probability using a sigmoid function

3.4 Section 3.4 — Decision Trees and Random Forests

Decision Trees and Random Forests are among the most popular algorithms in business ML because they are interpretable and handle mixed data types well.

3.4.1 Decision Trees

A Decision Tree splits data into groups by asking a series of yes/no questions.

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

# Re-use churn dataset
np.random.seed(42)
n = 800
df_churn = pd.DataFrame({
    "Tenure_Months" : np.random.randint(1, 72, n),
    "Monthly_Spend" : np.random.normal(70, 25, n).clip(20, 200).round(2),
    "Support_Calls" : np.random.poisson(1.5, n),
    "Satisfaction"  : np.random.uniform(1, 10, n).round(1),
    "Num_Products"  : np.random.randint(1, 6, n),
})
df_churn["Churned"] = (
    (df_churn["Satisfaction"] < 4.5) |
    (df_churn["Support_Calls"] > 4)
).astype(int)

X = df_churn.drop(columns="Churned")
y = df_churn["Churned"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)
Code
# Train a shallow decision tree (max 3 levels — readable)
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", f"{accuracy_score(y_test, y_pred_dt):.2%}")
Decision Tree Accuracy: 100.00%
Code
# Visualise the tree
fig, ax = plt.subplots(figsize=(16, 6))
plot_tree(dt,
          feature_names=X.columns,
          class_names=["Stayed", "Churned"],
          filled=True, rounded=True, fontsize=10, ax=ax)
ax.set_title("Decision Tree (max_depth=3)", fontsize=14)
plt.tight_layout()
plt.show()

3.4.2 Overfitting in Decision Trees

Code
depths     = range(1, 16)
train_acc  = []
test_acc   = []

for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=42)
    clf.fit(X_train, y_train)
    train_acc.append(accuracy_score(y_train, clf.predict(X_train)))
    test_acc.append(accuracy_score(y_test,  clf.predict(X_test)))

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(depths, train_acc, "b-o", label="Train Accuracy")
ax.plot(depths, test_acc,  "r-o", label="Test Accuracy")
ax.set_xlabel("Max Tree Depth")
ax.set_ylabel("Accuracy")
ax.set_title("Decision Tree: Accuracy vs Depth")
ax.legend()
ax.axvline(x=3, color="green", linestyle="--", label="Optimal depth ≈ 3")
plt.tight_layout()
plt.show()

3.4.3 Random Forest — Ensemble Learning

A Random Forest builds many decision trees on random data subsets, then averages their predictions. This reduces variance without much bias increase.

Code
rf = RandomForestClassifier(n_estimators=200, max_depth=6,
                             random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", f"{accuracy_score(y_test, y_pred_rf):.2%}")
print(classification_report(y_test, y_pred_rf, target_names=["Stayed","Churned"]))
Random Forest Accuracy: 100.00%
              precision    recall  f1-score   support

      Stayed       1.00      1.00      1.00       120
     Churned       1.00      1.00      1.00        80

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200
Code
# Feature importance
imp_df = pd.DataFrame({
    "Feature"   : X.columns,
    "Importance": rf.feature_importances_
}).sort_values("Importance", ascending=True)

fig, ax = plt.subplots(figsize=(8, 4))
ax.barh(imp_df["Feature"], imp_df["Importance"], color="teal")
ax.set_xlabel("Importance")
ax.set_title("Random Forest Feature Importance")
plt.tight_layout()
plt.show()

Code
# Cross-validation for robust evaluation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring="accuracy")
print(f"5-Fold CV Accuracy: {cv_scores.mean():.2%} ± {cv_scores.std():.2%}")
5-Fold CV Accuracy: 100.00% ± 0.00%

3.4.4 Student Task 3.4

  1. Train a DecisionTreeClassifier on the loan approval dataset (df_loan) with max_depth=4.
  2. Visualise the tree and identify the most important splitting feature at the root.
  3. Train a RandomForestClassifier with 100 trees and compare accuracy with the single tree.
  4. Plot feature importance from the Random Forest.
  5. Explain in plain language why a Random Forest generally outperforms a single Decision Tree.
Code
# Your code here

3.4.5 Evaluation Questions 3.4

  1. What is the purpose of max_depth in a Decision Tree?
    1. Limits the number of features used
    2. Limits how many levels deep the tree can grow, preventing overfitting (correct)
    3. Sets the number of trees in the forest
    4. Controls the learning rate
  2. A Random Forest improves over a single Decision Tree by:
    1. Using a more complex mathematical formula
    2. Averaging predictions of many trees trained on random subsets (correct)
    3. Using gradient descent
    4. Selecting only the most important features
  3. Feature importance in a Random Forest reflects:
    1. The correlation of each feature with the target
    2. How much each feature reduces impurity across all trees (correct)
    3. The number of times each feature appears in the data
    4. The p-value of each feature
  4. Cross-validation helps evaluate a model by:
    1. Training the model multiple times with different hyperparameters
    2. Testing the model on multiple different train/test splits (correct)
    3. Reducing the training dataset size
    4. Automatically tuning the number of trees
  5. Which statement about Decision Trees is TRUE?
    1. They always require feature scaling
    2. They cannot handle categorical variables
    3. A very deep tree typically overfits to training data (correct)
    4. They produce a linear decision boundary

4 Module 4: Business Applications of Machine Learning

This module connects ML methods to real business problems in three key domains: Marketing, Finance, and Operations.

4.1 Section 4.1 — ML in Marketing

Marketing generates rich customer data that ML can turn into competitive advantages: personalised offers, targeted campaigns, and churn prevention.

4.1.1 Customer Segmentation with K-Means

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

np.random.seed(10)
n = 500

df_mkt = pd.DataFrame({
    "Recency_Days"   : np.random.randint(1, 365, n),    # days since last purchase
    "Frequency"      : np.random.randint(1, 30, n),      # purchases per year
    "Monetary_Value" : np.random.lognormal(7, 1, n).round(2),  # avg spend
})

scaler = StandardScaler()
X_sc   = scaler.fit_transform(df_mkt)
Code
# Elbow method — choose optimal number of clusters
inertia = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_sc)
    inertia.append(km.inertia_)

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(K_range, inertia, "bo-")
ax.set_xlabel("Number of Clusters (k)")
ax.set_ylabel("Inertia (Within-cluster Sum of Squares)")
ax.set_title("Elbow Method for Optimal k")
ax.axvline(x=4, color="red", linestyle="--", label="Elbow at k=4")
ax.legend()
plt.tight_layout()
plt.show()

Code
# Fit K-Means with k=4
km4 = KMeans(n_clusters=4, random_state=42, n_init=10)
df_mkt["Segment"] = km4.fit_predict(X_sc)

# Profile each segment
seg_profile = df_mkt.groupby("Segment").agg(
    Avg_Recency   = ("Recency_Days",   "mean"),
    Avg_Frequency = ("Frequency",      "mean"),
    Avg_Monetary  = ("Monetary_Value", "mean"),
    Count         = ("Recency_Days",   "count")
).round(1)

print("Customer Segment Profiles (RFM):")
print(seg_profile)
Customer Segment Profiles (RFM):
         Avg_Recency  Avg_Frequency  Avg_Monetary  Count
Segment                                                 
0               83.5            8.7        1363.4    145
1              177.8           23.3        1612.9    181
2              286.0            8.6        1508.3    153
3              138.1           14.1        9836.6     21
Code
# Label segments based on RFM profile
seg_labels = {
    seg_profile["Avg_Monetary"].idxmax()  : "Champions",
    seg_profile["Avg_Recency"].idxmax()   : "At-Risk",
}

fig, ax = plt.subplots(figsize=(8, 5))
scatter = ax.scatter(df_mkt["Recency_Days"], df_mkt["Monetary_Value"],
                     c=df_mkt["Segment"], cmap="tab10", alpha=0.5, s=20)
ax.set_xlabel("Recency (Days Since Last Purchase)")
ax.set_ylabel("Monetary Value ($)")
ax.set_title("Customer Segments — RFM Clustering")
plt.colorbar(scatter, ax=ax, label="Segment")
plt.tight_layout()
plt.show()

4.1.2 Churn Prediction Model

Code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, RocCurveDisplay

np.random.seed(7)
n = 1000
df_churn2 = pd.DataFrame({
    "Recency_Days"    : np.random.randint(1, 365, n),
    "Frequency"       : np.random.randint(1, 50, n),
    "Avg_Spend"       : np.random.lognormal(4, 0.8, n).round(2),
    "Email_Opens"     : np.random.randint(0, 30, n),
    "NPS_Score"       : np.random.randint(1, 11, n),
})
df_churn2["Churned"] = (
    (df_churn2["Recency_Days"] > 180) & (df_churn2["NPS_Score"] < 6)
).astype(int)

X = df_churn2.drop(columns="Churned")
y = df_churn2["Churned"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            random_state=42, stratify=y)
rf_mkt = RandomForestClassifier(n_estimators=200, random_state=42)
rf_mkt.fit(X_tr, y_tr)

y_prob = rf_mkt.predict_proba(X_te)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_te, y_prob):.3f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(rf_mkt, X_te, y_te, ax=ax, name="RF Churn Model")
ax.set_title("ROC Curve — Churn Prediction")
plt.tight_layout()
plt.show()
AUC-ROC: 1.000


4.1.3 Student Task 4.1

Your marketing manager wants to design targeted email campaigns for different customer groups.

  1. Using df_mkt, try k = 3 and k = 5 clusters. Which do you prefer? Why?
  2. For each segment, write a brief marketing strategy (1–2 sentences) recommending how to engage that customer group.
  3. Using the churn model probabilities, create a DataFrame of the top 50 customers most likely to churn. What action would you recommend for each?
Code
# Your code here

4.1.4 Evaluation Questions 4.1

  1. RFM in customer analytics stands for:
    1. Revenue, Frequency, Market
    2. Recency, Frequency, Monetary (correct)
    3. Return, Function, Model
    4. Risk, Forecast, Margin
  2. The “elbow” in a K-Means elbow plot indicates:
    1. The maximum number of clusters allowed
    2. The point where adding more clusters yields diminishing returns (correct)
    3. An error in the data
    4. The optimal feature count
  3. K-Means is an example of which type of learning?
    1. Supervised learning
    2. Reinforcement learning
    3. Semi-supervised learning
    4. Unsupervised learning (correct)
  4. AUC-ROC of 0.5 indicates:
    1. Perfect classification
    2. 50 % accuracy
    3. Model is no better than random guessing (correct)
    4. 50 % of customers will churn
  5. A false negative in churn prediction means:
    1. Predicting a customer will churn when they will not
    2. Correctly identifying a loyal customer
    3. Predicting a customer will stay when they actually churn (correct)
    4. Correctly predicting a churner

4.2 Section 4.2 — ML in Finance

Finance teams use ML for credit scoring, fraud detection, portfolio management, and risk assessment.

4.2.1 Credit Scoring Model

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, RocCurveDisplay
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 2000

df_credit = pd.DataFrame({
    "Age"              : np.random.randint(18, 70, n),
    "Annual_Income"    : np.random.normal(55_000, 20_000, n).clip(15_000, 200_000).round(-2),
    "Credit_History"   : np.random.randint(0, 20, n),         # years
    "Existing_Debt"    : np.random.normal(15_000, 10_000, n).clip(0, 80_000).round(-2),
    "Employment_Status": np.random.choice([0, 1], n, p=[0.2, 0.8]),  # 0=unemployed
    "Num_Loans"        : np.random.randint(0, 8, n),
})

# Default probability influenced by debt ratio and employment
debt_ratio       = df_credit["Existing_Debt"] / df_credit["Annual_Income"]
default_prob     = (0.3 * debt_ratio + 0.2 * (1 - df_credit["Employment_Status"])
                    + 0.1 * (df_credit["Num_Loans"] / 8)).clip(0, 1)
df_credit["Default"] = np.random.binomial(1, default_prob)

print("Default rate: {:.1%}".format(df_credit["Default"].mean()))

X = df_credit.drop(columns="Default")
y = df_credit["Default"]

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc = scaler.transform(X_te)

gb_model = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42)
gb_model.fit(X_tr_sc, y_tr)

y_prob_credit = gb_model.predict_proba(X_te_sc)[:, 1]
print(f"\nAUC-ROC: {roc_auc_score(y_te, y_prob_credit):.3f}")
print(classification_report(y_te, gb_model.predict(X_te_sc),
                             target_names=["No Default","Default"]))
Default rate: 18.4%

AUC-ROC: 0.672
              precision    recall  f1-score   support

  No Default       0.83      0.91      0.87       326
     Default       0.32      0.18      0.23        74

    accuracy                           0.78       400
   macro avg       0.57      0.54      0.55       400
weighted avg       0.74      0.78      0.75       400
Code
# Assign credit scores (higher = lower default risk)
all_prob = gb_model.predict_proba(scaler.transform(X))[:, 1]
df_credit["Credit_Score"] = (1000 * (1 - all_prob)).round(0).astype(int)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].hist(df_credit["Credit_Score"], bins=40, color="steelblue", edgecolor="white")
axes[0].set_title("Predicted Credit Score Distribution")
axes[0].set_xlabel("Score")
axes[0].set_ylabel("Count")

RocCurveDisplay.from_estimator(gb_model, X_te_sc, y_te, ax=axes[1],
                                name="Gradient Boosting")
axes[1].set_title("ROC Curve — Default Prediction")

plt.tight_layout()
plt.show()

4.2.2 Fraud Detection

Code
from sklearn.ensemble import IsolationForest

np.random.seed(55)
n_normal = 1900
n_fraud  = 100

df_fraud = pd.DataFrame({
    "Amount"    : np.concatenate([np.random.lognormal(4, 1, n_normal),
                                   np.random.uniform(5_000, 20_000, n_fraud)]),
    "Hour"      : np.concatenate([np.random.randint(6, 22, n_normal),
                                   np.random.randint(0, 6, n_fraud)]),
    "Merchant_Risk": np.concatenate([np.random.uniform(0, 0.3, n_normal),
                                       np.random.uniform(0.7, 1.0, n_fraud)]),
    "True_Fraud": [0] * n_normal + [1] * n_fraud
})

X_fraud = df_fraud[["Amount","Hour","Merchant_Risk"]]
iso     = IsolationForest(contamination=0.05, random_state=42)
iso.fit(X_fraud)
df_fraud["Predicted_Fraud"] = (iso.predict(X_fraud) == -1).astype(int)

tp = ((df_fraud["True_Fraud"] == 1) & (df_fraud["Predicted_Fraud"] == 1)).sum()
print(f"Fraud correctly flagged (Recall): {tp/n_fraud:.1%}")
Fraud correctly flagged (Recall): 97.0%

4.2.3 Student Task 4.2

  1. Using df_credit, plot feature importance for the Gradient Boosting model. Which three factors most strongly predict loan default?
  2. Create a bar chart showing average default rate by number of existing loans (Num_Loans). What pattern emerges?
  3. What ethical concerns should a bank consider when using an ML credit scoring model? Write 2–3 sentences.
Code
# Your code here

4.2.4 Evaluation Questions 4.2

  1. In credit scoring, a high AUC-ROC score indicates:
    1. The model makes many false positives
    2. The model is good at distinguishing defaulters from non-defaulters (correct)
    3. The loan approval rate is high
    4. The model was trained on a large dataset
  2. Gradient Boosting builds models by:
    1. Training one large decision tree
    2. Averaging many independent trees in parallel
    3. Sequentially adding trees that correct previous errors (correct)
    4. Clustering customers before prediction
  3. An Isolation Forest detects anomalies by:
    1. Calculating the distance from cluster centroids
    2. Identifying points that are easy to isolate in fewer splits (correct)
    3. Using logistic regression probabilities
    4. Training on only fraudulent transactions
  4. Why is class imbalance a challenge in fraud detection?
    1. Fraud happens too frequently
    2. The model may learn to predict “no fraud” for all cases and still get high accuracy (correct)
    3. There are too many features in financial datasets
    4. Neural networks cannot detect fraud
  5. What does predict_proba() return for a binary classifier?
    1. The predicted class label (0 or 1)
    2. The feature importance array
    3. A probability for each class (correct)
    4. The confusion matrix

4.3 Section 4.3 — ML in Operations

Operations teams use ML for demand forecasting, supply chain optimisation, quality control, and predictive maintenance.

4.3.1 Demand Forecasting

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

np.random.seed(0)
dates      = pd.date_range("2020-01-01", periods=104, freq="W")  # 2 years of weekly data
trend      = np.linspace(500, 800, 104)
seasonality= 100 * np.sin(2 * np.pi * np.arange(104) / 52)
noise      = np.random.normal(0, 30, 104)
demand     = (trend + seasonality + noise).clip(0).round()

df_ops = pd.DataFrame({
    "Date"         : dates,
    "Demand"       : demand,
    "Week_of_Year" : dates.isocalendar().week.values,
    "Quarter"      : dates.quarter,
    "Promotion"    : np.random.choice([0, 1], 104, p=[0.75, 0.25]),
    "Price_Index"  : np.random.uniform(0.95, 1.05, 104).round(3),
})

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df_ops["Date"], df_ops["Demand"], color="steelblue", linewidth=1.2)
ax.set_title("Weekly Product Demand (2 Years)")
ax.set_xlabel("Date")
ax.set_ylabel("Units Demanded")
plt.tight_layout()
plt.show()

Code
# Feature engineering for forecasting
X = df_ops[["Week_of_Year", "Quarter", "Promotion", "Price_Index"]]
y = df_ops["Demand"]

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, shuffle=False)      # no shuffle — time order matters!

rf_ops = RandomForestRegressor(n_estimators=200, random_state=42)
rf_ops.fit(X_tr, y_tr)
y_pred_ops = rf_ops.predict(X_te)

mae = mean_absolute_error(y_te, y_pred_ops)
r2  = r2_score(y_te, y_pred_ops)
print(f"Demand Forecast MAE : {mae:.1f} units")
print(f"Demand Forecast R²  : {r2:.3f}")
Demand Forecast MAE : 176.6 units
Demand Forecast R²  : -14.382
Code
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(y_te.values, label="Actual Demand", color="steelblue")
ax.plot(y_pred_ops,  label="Forecast",      color="orange", linestyle="--")
ax.set_title("Demand Forecast vs Actual (Test Period)")
ax.set_xlabel("Week")
ax.set_ylabel("Units")
ax.legend()
plt.tight_layout()
plt.show()

4.3.2 Predictive Maintenance

Code
np.random.seed(22)
n = 1000

# Machine sensor readings
df_maint = pd.DataFrame({
    "Temperature"   : np.random.normal(75, 10, n),
    "Vibration"     : np.random.normal(0.5, 0.1, n),
    "Operating_Hrs" : np.random.randint(100, 10_000, n),
    "Pressure"      : np.random.normal(100, 15, n),
})

# Failure more likely when temperature is high and vibration is high
failure_prob = (
    0.001 * df_maint["Temperature"] +
    0.3   * df_maint["Vibration"]   +
    0.00001 * df_maint["Operating_Hrs"] - 0.05
).clip(0, 1)

df_maint["Failure"] = np.random.binomial(1, failure_prob)
print(f"Failure rate: {df_maint['Failure'].mean():.2%}")

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_m = df_maint.drop(columns="Failure")
y_m = df_maint["Failure"]
X_mtr, X_mte, y_mtr, y_mte = train_test_split(X_m, y_m, test_size=0.2,
                                                random_state=42, stratify=y_m)

rf_maint = RandomForestClassifier(n_estimators=100, random_state=42)
rf_maint.fit(X_mtr, y_mtr)
print(classification_report(y_mte, rf_maint.predict(X_mte),
                             target_names=["OK","Failure"]))
Failure rate: 24.30%
              precision    recall  f1-score   support

          OK       0.76      0.96      0.85       151
     Failure       0.33      0.06      0.10        49

    accuracy                           0.74       200
   macro avg       0.55      0.51      0.48       200
weighted avg       0.65      0.74      0.67       200

4.3.3 Student Task 4.3

  1. Using df_ops, investigate whether promotions significantly increase demand. Calculate average demand with and without a promotion.
  2. Retrain the demand forecasting model adding a Lag_1_Demand feature (last week’s demand). Does R² improve?
  3. For the predictive maintenance model, plot feature importance. Which sensor reading is the strongest predictor of machine failure?
  4. Describe how a manufacturer could use this model to reduce downtime costs.
Code
# Your code here

4.3.4 Evaluation Questions 4.3

  1. Why should demand forecasting data not be shuffled before splitting train/test?
    1. Shuffling causes data loss
    2. The time sequence must be preserved so the model does not see future data (correct)
    3. Shuffling makes models slower to train
    4. Demand data is already sorted by default
  2. A lag feature (e.g., last week’s demand) is useful because:
    1. It reduces the training set size
    2. It captures temporal patterns and autocorrelation (correct)
    3. It replaces the need for trend features
    4. It removes seasonality from the data
  3. Predictive maintenance uses ML to:
    1. Automate equipment purchasing
    2. Identify which machines are most expensive
    3. Predict equipment failure before it occurs to schedule proactive maintenance (correct)
    4. Optimise the number of shifts per day
  4. What is the business benefit of reducing false negatives in a machine failure model?
    1. Fewer unnecessary maintenance interventions
    2. Lower probability of unexpected breakdowns and costly downtime (correct)
    3. Better accuracy on the training set
    4. Reduced energy consumption
  5. Which ML model type is most appropriate for predicting a continuous quantity like weekly demand?
    1. Logistic Regression
    2. K-Means Clustering
    3. Random Forest Regressor (correct)
    4. Isolation Forest

4.4 Section 4.4 — Building End-to-End ML Solutions

Delivering an ML model as a business solution requires more than good accuracy. This section covers pipelines, model persistence, and performance monitoring.

4.4.1 Sklearn Pipelines

A Pipeline chains preprocessing and modelling into a single reusable object.

Code
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
import warnings; warnings.filterwarnings("ignore")

np.random.seed(42)
n = 800

# Re-create loan dataset
df_loan_final = pd.DataFrame({
    "Income"         : np.random.normal(55_000, 20_000, n).clip(20_000, 150_000).round(-2),
    "Loan_Amount"    : np.random.normal(25_000, 10_000, n).clip(5_000, 80_000).round(-2),
    "Credit_Score"   : np.random.randint(500, 850, n),
    "Age"            : np.random.randint(22, 65, n),
    "Employment_Yrs" : np.random.randint(0, 30, n),
})
df_loan_final["Approved"] = (
    (df_loan_final["Credit_Score"] > 650) &
    (df_loan_final["Income"] > 40_000)
).astype(int)

X = df_loan_final.drop(columns="Approved")
y = df_loan_final["Approved"]

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Build pipeline: scale → classify
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    RandomForestClassifier(n_estimators=100, random_state=42))
])

pipe.fit(X_tr, y_tr)
y_pred_pipe = pipe.predict(X_te)

from sklearn.metrics import accuracy_score
print(f"Pipeline Accuracy: {accuracy_score(y_te, y_pred_pipe):.2%}")
Pipeline Accuracy: 100.00%

4.4.2 Hyperparameter Tuning with GridSearchCV

Code
param_grid = {
    "clf__n_estimators" : [50, 100, 200],
    "clf__max_depth"    : [3, 5, None],
}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
gs.fit(X_tr, y_tr)

print("Best parameters:", gs.best_params_)
print(f"Best CV Accuracy: {gs.best_score_:.2%}")
print(f"Test Accuracy   : {gs.score(X_te, y_te):.2%}")
Best parameters: {'clf__max_depth': 3, 'clf__n_estimators': 50}
Best CV Accuracy: 100.00%
Test Accuracy   : 100.00%

4.4.3 Saving and Loading Models

Code
import joblib

# Save the trained pipeline to disk
joblib.dump(gs.best_estimator_, "loan_approval_model.pkl")
print("Model saved to loan_approval_model.pkl")

# Load and use the model
loaded_model = joblib.load("loan_approval_model.pkl")

# Predict for a new applicant
new_applicant = pd.DataFrame([{
    "Income"         : 62_000,
    "Loan_Amount"    : 20_000,
    "Credit_Score"   : 710,
    "Age"            : 35,
    "Employment_Yrs" : 8
}])

prediction  = loaded_model.predict(new_applicant)[0]
probability = loaded_model.predict_proba(new_applicant)[0, 1]

outcome = "APPROVED" if prediction == 1 else "DENIED"
print(f"\nLoan Application Decision: {outcome}")
print(f"Approval Probability: {probability:.2%}")
Model saved to loan_approval_model.pkl

Loan Application Decision: APPROVED
Approval Probability: 85.30%

4.4.4 Model Monitoring Checklist

Once a model is deployed, track these indicators:

Indicator What to Monitor Alert Threshold
Accuracy drift Monthly accuracy vs baseline Drop > 5 %
Data drift Distribution shift in features KS-test p < 0.05
Prediction drift Change in predicted class ratio > 10 % deviation
Business KPI Revenue / churn / default rate Defined by business
Code
# Simulate model performance monitoring over time
np.random.seed(3)
months        = pd.date_range("2024-01", periods=12, freq="MS")
baseline_acc  = 0.87
acc_over_time = np.cumsum(np.random.normal(0, 0.01, 12)).clip(-0.15, 0)
monthly_acc   = (baseline_acc + acc_over_time).clip(0.5, 1)

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(months, monthly_acc, "b-o", label="Monthly Accuracy")
ax.axhline(y=baseline_acc,         color="green", linestyle="--", label="Baseline")
ax.axhline(y=baseline_acc - 0.05,  color="red",   linestyle="--", label="Alert threshold")
ax.fill_between(months, monthly_acc, baseline_acc - 0.05,
                where=(monthly_acc < baseline_acc - 0.05),
                alpha=0.3, color="red", label="Below threshold")
ax.set_title("Model Performance Monitoring — Monthly Accuracy")
ax.set_ylabel("Accuracy")
ax.legend()
plt.tight_layout()
plt.show()


4.4.5 Student Task 4.4

Build a complete end-to-end ML solution for the customer churn dataset:

  1. Create a Pipeline that includes StandardScaler and RandomForestClassifier.
  2. Use GridSearchCV to tune n_estimators (50, 100) and max_depth (3, 5, 10).
  3. Save the best model using joblib.
  4. Load the saved model and make a prediction for a new, unseen customer you define.
  5. Summarise the model in one paragraph as if you were presenting to a non-technical business manager.
Code
# Your code here

4.4.6 Evaluation Questions 4.4

  1. What is the primary benefit of using an sklearn Pipeline?
    1. It automatically improves model accuracy
    2. It chains preprocessing and modelling into one reproducible object (correct)
    3. It replaces the need for cross-validation
    4. It speeds up data loading
  2. In GridSearchCV, the parameter cv=5 means:
    1. Only 5 hyperparameter combinations are tested
    2. The model is evaluated using 5-fold cross-validation (correct)
    3. Training runs for 5 epochs
    4. The best 5 features are selected
  3. Data drift occurs when:
    1. The model’s code has a bug
    2. The distribution of input features changes over time (correct)
    3. The model is retrained too frequently
    4. The training data has too many rows
  4. Which joblib function saves a trained model to disk?
    1. joblib.save()
    2. joblib.export()
    3. joblib.store()
    4. joblib.dump() (correct)
  5. Why should a deployed ML model be retrained periodically?
    1. To increase the size of the training dataset automatically
    2. Because older models always have bugs that need fixing
    3. Because real-world data distributions change over time, causing model performance to degrade (correct)
    4. sklearn models expire after 12 months

5 Midterm Exam Preparation

The midterm covers Modules 1 and 2. Use the following practice problems to prepare.

5.1 Sample Practice Problems

5.1.1 Practice 1 — Python Fundamentals

Code
# Problem: Complete the function below
def categorise_customer(annual_spend, years_as_customer):
    """
    Return a customer tier based on:
    - Platinum : spend >= 50_000 OR tenure >= 10 years
    - Gold     : spend >= 20_000 OR tenure >= 5 years
    - Silver   : spend >= 5_000
    - Bronze   : all others
    """
    # YOUR CODE HERE
    pass

# Test cases
test_cases = [
    (60_000, 3),   # Platinum (spend)
    (15_000, 12),  # Platinum (tenure)
    (25_000, 4),   # Gold
    (7_000,  2),   # Silver
    (1_200,  1),   # Bronze
]

for spend, tenure in test_cases:
    tier = categorise_customer(spend, tenure)
    print(f"Spend=${spend:>7,}, Tenure={tenure:>2}y → {tier}")
Spend=$ 60,000, Tenure= 3y → None
Spend=$ 15,000, Tenure=12y → None
Spend=$ 25,000, Tenure= 4y → None
Spend=$  7,000, Tenure= 2y → None
Spend=$  1,200, Tenure= 1y → None

5.1.2 Practice 2 — Data Cleaning

Code
# Messy dataset — clean it
import pandas as pd
import numpy as np

np.random.seed(77)
df_messy = pd.DataFrame({
    "product_id" : range(1, 51),
    "price"      : np.where(np.random.rand(50) < 0.10, np.nan,
                             np.random.uniform(10, 500, 50).round(2)),
    "category"   : np.where(np.random.rand(50) < 0.08, np.nan,
                             np.random.choice(["A","B","C","D"], 50)),
    "rating"     : np.where(np.random.rand(50) < 0.12, np.nan,
                             np.random.uniform(1, 5, 50).round(1)),
    "units_sold" : np.random.randint(0, 1000, 50),
})

print("Missing values:")
print(df_messy.isnull().sum())

# Clean the dataset
df_clean = df_messy.copy()
# Fill numeric missing values with median
df_clean["price"]  = df_clean["price"].fillna(df_clean["price"].median())
df_clean["rating"] = df_clean["rating"].fillna(df_clean["rating"].median())
# Fill categorical missing values with mode
df_clean["category"] = df_clean["category"].fillna(df_clean["category"].mode()[0])

print("\nAfter cleaning:")
print(df_clean.isnull().sum())
Missing values:
product_id    0
price         6
category      0
rating        3
units_sold    0
dtype: int64

After cleaning:
product_id    0
price         0
category      0
rating        0
units_sold    0
dtype: int64

5.1.3 Practice 3 — EDA

Code
import matplotlib.pyplot as plt

# Summary statistics and visualisation
print(df_clean.describe().round(2))

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Distribution of price
axes[0].hist(df_clean["price"], bins=20, color="steelblue", edgecolor="white")
axes[0].set_title("Price Distribution")
axes[0].set_xlabel("Price ($)")

# Average rating by category
avg_rating = df_clean.groupby("category")["rating"].mean().sort_values(ascending=False)
axes[1].bar(avg_rating.index, avg_rating.values, color="coral", edgecolor="white")
axes[1].set_title("Average Rating by Category")
axes[1].set_xlabel("Category")
axes[1].set_ylabel("Rating")

# Scatter: price vs units sold
axes[2].scatter(df_clean["price"], df_clean["units_sold"],
                alpha=0.5, color="teal")
axes[2].set_title("Price vs Units Sold")
axes[2].set_xlabel("Price ($)")
axes[2].set_ylabel("Units Sold")

plt.tight_layout()
plt.show()
       product_id   price  rating  units_sold
count       50.00   50.00   50.00       50.00
mean        25.50  252.78    2.87      437.26
std         14.58  143.89    1.14      259.13
min          1.00   15.84    1.00        4.00
25%         13.25  139.36    1.90      230.00
50%         25.50  234.50    2.90      411.00
75%         37.75  370.86    3.88      655.50
max         50.00  499.33    4.90      988.00


6 Summary and Key Takeaways

Module Core Skills
1 — Python Fundamentals Variables, conditionals, loops, data structures, NumPy, Pandas
2 — EDA Missing data, scaling, feature selection, visualisation
3 — Machine Learning Workflow, regression, classification, trees, forests
4 — Business Applications Segmentation, churn, credit, demand forecasting, deployment

6.1 Learning Path Forward

  1. Practice daily: Kaggle has free datasets and competitions.
  2. Apply to your domain: Every industry has data — find problems in your area.
  3. Communicate clearly: A model you cannot explain to a business audience has limited value.
  4. Stay ethical: Understand bias, fairness, and regulatory requirements (GDPR, Equal Credit Opportunity Act).

These lecture notes were produced using Quarto. Code examples use Python 3.10+ with scikit-learn 1.x, Pandas 2.x, and NumPy 1.x.